I expect to become more fluent with R tools and the data science workflow. This course seems to be a good place to get an all-around grasp of the basic ways of working with data science tools.
https://github.com/juhopaak/IODS-project
date()
## [1] "Wed Dec 2 20:37:36 2020"
date()
## [1] "Wed Dec 2 20:37:36 2020"
Let’s begin by reading the analysis dataset created in the data wrangling part, and exploring its structure and dimensions.
data <- read.csv("data/learning2014.csv")
dim(data)
## [1] 166 7
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The dataset has 166 observations and 7 variables. The variables are “gender”, “age”, “attitude”, “deep”, “stra”, “suft”, and “points”. The observations are student answers to the international survey of Approaches to Learning, on an introductory statistics course in 2014. The attitude variable is a score indicating the student’s attitude toward statistics, on scale 1-5. Thee deep, stra, and surf variables are average scores of answers to questions about different approaches to learning (deep learning, strategic learning, and surface learning), measured on the Likert scale. The points variable is an integer representing the student’s test score.
Let’s continue by looking at summaries of the variables and plotting the data.
summary(data)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
We can first see that the student gender distribution includes almost twice the number of females in comparison to males.
The youngest students were 17 years old and the oldest 55 years. The 3rd quartile of age is 27, which means that most of the students are between 17-27 years old. We can also see this in the variable’s histogram plot.
hist(data$age)
The attitude variable is distributed between 1.4 and 5, with half of the answers between 2.6 and 3.7. This variable’s distribution resembles the normal distribution, as displayed below.
hist(data$attitude)
The deep variable is distributed between 1.583 and 4.917, but the 1st quartile is 3.333, so most of the answers were between 3.333 and 4.917. In the stra variable we see a more even distribution between 1.250 and 5, but still the 1st quartile is 2.625, so most of the values are clustered between 2.625 and 3.625. The surf variable, on the other hand, seems to be somewhat skewed towards lower values, with distribution between 1.583 and 4.333 but the 1st quartile at 2.417. The histograms below confirm these observations.
par(mfrow =c(2,2))
hist(data$deep)
hist(data$stra)
hist(data$surf)
The test points are distributed between 7 and 33, but the 1st quartile is 19, the median 23 and the 3rd quartile 27.75. It seems that the scores are distributed quite evenly between 19 and 27.75. Most of the scores seem to be between 15 and 30, as shown below.
hist(data$points)
Let’s next look at a scatter diagram of the data, which displays the relationships between the different variables.
pairs(data[-1], col=data$gender)
From inspecting the scatter diagram it seems that the variables most strongly correlated with points are attitude, stra, and surf. Let’s try these three as explanatory variables in the linear model.
model <- lm(points ~ attitude+stra+surf, data = data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the coefficient estimates of different parameters we can see that attitude predicts the value of points with the effect of 3.3952, whereas stra and surf with 0.8531 and -0.5861 respectively. This means that value changes in the attitude parameter are more strongly associated with changes in the value of points. In fact, stra and surf are quite weakly associated with points, which is also reflected in their p-scores, which indicate that these variables do not have a significant relationship (<0.05) to the points variable. A large p-score means that the probability of observing at least as extreme values of the target variable is be high, even though the coefficients were not related to the target variable. The p-values of stra and surf mean that we have no strong evidence on the basis of the data to reject the null hypothesis that stra and surf have no effect on the values of the points variable.
Let’s fit the model again with just the attitude as explanatory variable.
model <- lm(points ~ attitude, data = data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In this model, the coefficient estimates show that attitude predicts the value of points with the effect of 3.5255, which is slightly stronger than in the previous model. Also the standard error of the coefficient is slightly smaller, and the p-value is notably smaller. The results show that attitude is a highly significant predictor of points.
The multiple R-squared gives the proportion of variation in the target variable points that is explained by the regression coefficients, that is, attitude. We can see that attitude alone explains around 19% of the variation in the points variable.
Finally, let’s examine diagnostic plots of the model.
par(mfrow =c(2,2))
plot(model, which=c(1,2,5))
The model is based on four important assumptions: linearity, constant variance, normality of residuals, and that the model’s errors are independent. The plots above can be used to examine the validity of the latter three.
The constant variance assumption holds that the size of the model’s errors should not depend on the explanatory variable. This assumption can be validated by plotting the residuals (model errors) against fitted values. The assumption is validated if the resulting scatterplot is reasonably evenly distributed. Indeed, we can see from the residuals vs. fitted values plot above that this is the case.
The normality assumption holds that the model’s errors are normally distributed. This assumption can be validated by plotting the residuals against values drawn from the normal distribution. If these values fall roughly along the diagonal line in the Normal Q-Q plot, then the model’s errors are approximately normally distributed.
The residuals vs. leverage plot helps assess whether there are some observations that have a large influence on the estimation of the model’s parameters. If there are some observations that have high leverage, then excluding them from the analysis has a large influence on the model. In the plot, if some points fall within areas delimited by the red dashed line (Cook’s distance), then they correspond to influential observations. It seems that in this model there are no highly influential residuals.
Together from these plots we can see that the model’s assumptions seem to be valid.
date()
## [1] "Wed Dec 2 20:37:37 2020"
Let’s begin by reading the joined dataset from local directory.
data <- read.csv("data/alc.csv")
dim(data)
## [1] 382 35
As displayed above, the data includes 382 observations and 35 variables. The data are about student achievement in secondary education of two Portuguese schools, collected by using school reports and questionnaires. The data used in this analysis is a combination of two datasets from distinct subjects: Mathematics and Portuguese language. The dataset includes the following variables.
colnames(data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
In this analysis we will study the relationship between high/low alchohol consumption of the students (the variable “high_use”), and four other variables in the data. As the four explanatory variables, let’s select
My personal hypothesis is that all these variables will be significantly related to the students’ volume of alcohol consumption. In particular, I expect that (1) the older the student and the more time they use for studying, the less likely they are to consume high volumes of alcohol; and (2) students who have had many class failures and have a low quality of family relationships are more likely to consume a lot of alcohol.
Let’s next explore the distributions of these four variables and their relationship with alcohol consumption.
Let’s first look at bar charts of the four explanatory variables.
par(mfrow =c(2,2))
hist(data$age)
hist(data$studytime)
hist(data$failures)
hist(data$famrel)
It seems that most of the students are between ages 15-18, and that most of them use less than 5 hours a week for studying. Further, most of them have had no class failures, and their family relations are generally very good.
If we then look at the summary and a boxplot of the students’ alcohol use, we see that the means of their answers to daily and weekend alcohol use are distributed between 1 and 5, but the majority of answers are between 1 and 2.5. In fact, it seems that the mean score of 5 is an outlier, as we can see from the boxplot.
summary(data$alc_use)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.500 1.889 2.500 5.000
boxplot(data$alc_use)
If we then look at cross-tabulations of the four explanatory variables and the logical variable representing high/low alcohol use, we see that my initial hypotheses will not be very likely to hold, at least without qualification.
First, from the table of age against alcohol use, we see that the highest ratios of high alcohol use in comparison to low alcohol use are in the age groups of over 16. This means that the older students consume more alcohol than the younger students, contrary to what I expected.
table(data$age, data$high_use)
##
## FALSE TRUE
## 15 63 18
## 16 79 28
## 17 64 36
## 18 54 27
## 19 7 4
## 20 1 0
## 22 0 1
From the table of study time against alcohol use, we see that indeed students who study less than 5 hours a week seem to consume more alcohol than those who study more.
table(data$studytime, data$high_use)
##
## FALSE TRUE
## 1 58 42
## 2 135 60
## 3 52 8
## 4 23 4
However, when tabulating past failures against alcohol use, we see that most of the students who have high alcohol consumption have no past failures. This is likely due to the fact that so few students in the data have past failures in the first place.
table(data$failures, data$high_use)
##
## FALSE TRUE
## 0 244 90
## 1 12 12
## 2 10 9
## 3 2 3
Finally, from the table of family relations against alcohol use, we see that most of the students with high alcohol consumption have good family relations - again likely due to most of the students in the data having good relations.
table(data$famrel, data$high_use)
##
## FALSE TRUE
## 1 6 2
## 2 10 9
## 3 39 25
## 4 135 54
## 5 78 24
Let’s now explore the relationship between these variables using logistic regression. The call glm() below fits the model to the alc data. The summary of the model is given after.
model <- glm(high_use ~ age + studytime + failures + famrel, data = data, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ age + studytime + failures + famrel,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4449 -0.8466 -0.6790 1.2448 2.2240
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1233 1.7523 -1.212 0.225624
## age 0.1961 0.1023 1.916 0.055335 .
## studytime -0.5466 0.1584 -3.451 0.000559 ***
## failures 0.2855 0.1928 1.480 0.138767
## famrel -0.2541 0.1265 -2.009 0.044537 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 436.65 on 377 degrees of freedom
## AIC: 446.65
##
## Number of Fisher Scoring iterations: 4
From the results we can see that two variables in the model are significant predictors of high alcohol use, namely studytime and famrel. Of these, studytime is a highly significant predictor. The sign of the estimates of the coefficients of both these variables are negative. This means that students who study more are less likely to consume high volumes of alcohol, and similarly students who have good family relations are less likely to drink high volumes. Each increase in studytime will decrease the log odds of having high alcohol consumption by 0.5466, and each increase in family relations will decrease the odds by 0.2541.
The results also show that there is no strong evidence for rejecting the null hypothesis that the students’ age and past class failures are related to their high/low alcohol consumption.
Let’s now compute the odds ratios (OR in the below table) of the model coefficients, and provide confidence intervals for them.
# Odds ratios (OR) from model coefficients
OR <- exp(coef(model))
# Confidence intervals (CI)
CI <- exp(confint(model))
## Waiting for profiling to be done...
# Tabulate results
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1196357 0.003741489 3.6605851
## age 1.2166121 0.997143320 1.4906891
## studytime 0.5789116 0.420249609 0.7831056
## failures 1.3303606 0.911465723 1.9525463
## famrel 0.7756136 0.604403440 0.9941132
The coefficient of studytime has odds ratio of ~0.58, which means that students who study more are almost twice as likely to not have high alcohol consumption than students who study less. In the case of famrel, the odds ratio is ~0.78, which means that family relations are not as strongly associated with high consumption as study time. But still students who have good family relations are more likely to have low alcohol consumption than students with good relations. We can also see that student age and past failures are positively related to alcohol consumption, but these variables did not have a significant relationship in the model, so we cannot take this as evidence that there actually exists a relationship in the data.
The confidence intervals give the range of values within which the strength of the relationships between the variables are likely to fall with 95% confidence. So of 100 times that we calculate a confidence interval for e.g. the strenght of the relationship between studytime and high/low alcohol use, 95 times of these the true value will be within the specified range.
For the coefficient estimate of studytime, the 95% confidence interval is (0.42, 0.78), and for famrel (0.60, 0.99). Both of these ranges preserve the direction of the relationship. However, for instance for the failures variable, the interval is (0.91, 1.95). First, this is quite wide, and second we cannot say with 95% for this variable whether having many failures has a positive or negative influence on the log odds of having high alcohol consumption. We can see that the same is true of the age variable, although the interval is not as wide.
These results contradict my initial hypotheses in that age and past failures were not significantly related to high alcohol consumption. Study time and family relations, however, were significantly related, as I expected. Further, study time was a highly significant predictor.
Let’s now evaluate the predictive power of these significant explanatory variables. Let’s start by fitting a model with just these two variables.
model <- glm(high_use ~ studytime + famrel, data = data, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ studytime + famrel, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3749 -0.8225 -0.7364 1.3140 2.1048
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3041 0.5700 2.288 0.022139 *
## studytime -0.5946 0.1529 -3.888 0.000101 ***
## famrel -0.2563 0.1243 -2.062 0.039207 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 444.08 on 379 degrees of freedom
## AIC: 450.08
##
## Number of Fisher Scoring iterations: 4
In this model, studytime seems to have an even stronger relationship with alcohol use, while the relationship between famrel and alcohol use remained nearly the same.
Let’s now predict the probability of high_use using this model, and tabulate the results.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
probabilities <- predict(model, type = "response")
# add the predicted probabilities to data
data <- mutate(data, probability = probabilities)
# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 265 3
## TRUE 109 5
From the confusion matrix we can see that the model correctly predicts 265 out of 268 cases of low alcohol, but incorrectly predicts that 109 cases of high alcohol use would have low use. Only 5 out of the total of 114 cases of high use were predicted correctly. Let’s compute the training error to see how the model fares overall.
The following code defines mean prediction error as the loss function and calculates for the model’s predictions.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = data$high_use, prob = data$probability)
## [1] 0.2931937
The mean prediction error is 0.2931937, which means that the model correctly predicts high/low alcohol consumption in over 70% of the cases in the data.
To end this exercise, let’s still do 10-fold cross-validation and compare the results with the model introduced in DataCamp (with error of 0.2617801).
# We'll use the package boot for performing cross-validation
library(boot)
# K=10 means 10 folds
cv <- cv.glm(data = data, cost = loss_func, glmfit = model, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2931937
My model has higher average error with 10-fold cv than the DataCamp model. Better than flipping a coin, though.
date()
## [1] "Wed Dec 2 20:37:40 2020"
Let’s load the Boston data and explore its dimensions. The dataset contains information collected by the U.S. Census Service, about housing in the Boston area.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The dataset has 506 observations and 14 variables. All the variables are numeric, and they are about phenomena such as crime rate per capita by town (crim), proportion of non-retail business acres per town (indus), and average number of rooms per dwelling (rm).
Let’s now summarize the variables in the data, and then look at a graphical overview of them.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
It seems that the first two variables (crim - crime rate per capita, and zn - proportion of residential land zoned for lots over 25,000 sq.ft) vary between quite a large range, but are strongly skewed towards small values. The rest of the variables seem to be more evenly distributed between their minimum and maximum values.
Let’s now look at a graphical overview of the variables.
pairs(Boston)
It’s pretty hard to get a good overview of so many variables at a glance, so we’d need to do more detailed examinations of the relationships between them. Let’s use the correlation plot to first look at the relationships between all the variables, and then get a better look at the most interesting ones.
library(corrplot)
## corrplot 0.84 loaded
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper")
It seems that crim is positively correlated with rad and tax, whereas the indus variable is positively correlated with nox and tax, but negatively correlated with dis. Let’s plot these variables against each other.
We’ll begin with crim, rad and tax, which measure crime rate per capita by town, index of accessibility to radial highways, and full-value property tax rate per $10,000, respectively
pairs(~crim+rad+tax,data=Boston)
High values of both rad and tax seem to also bring crime rate up, but there’s a break in the relationship in mid values of these variables. Let’s next look at the indus variable against nox, tax and dis. These variables measure the proportion of non-retail business acres per town (indus), nitric oxides concentration (nox), property tax, and the distance to five Boston employment centres (dis).
pairs(~indus+nox+tax+dis, data=Boston)
Interestingly, the proportion of non-retail business acres per town indeed seems to grow with nitric oxides concentration. The positive relationship seems to be less clear with property tax, while the negative relationshipn with distance to Boston employment centres seems quite strong.
Let’s now standardize the dataset and create a categorical variable of crime rate per quantiles in order to predict it using linear discriminant analysis (LDA).
We’ll begin by scaling and summarizing the data again.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
The variables now have zero means and their values have been normalized in relation to the original data’s standard deviation. This means that the value distributions now tell us how much each variable varies around their means in relation to the standard deviation, which makes their values comparable against each other. For instance, the max value in the crim variable now is at 9.924110, while the max of the tax variable is only at 1.7964. The positive values of the tax variable seem to be closer to the variable’s mean, so it has fewer outliers than crim. This we can also see from the boxplots of these variables
par(mfrow =c(1,2))
boxplot(Boston$crim, xlab="Boston$crim")
boxplot(Boston$tax, xlab="Boston$tax")
Let’s change the scaled data into a data frame and create a categorical variable of crime rate using the quantiles as break points.
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
# We'll label the values of the new variable "low", "med_low", "med_high", and "high"
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
Finally, let’s replace the old crime variable in the scaled dataset with this new variable and tabulate its values.
boston_scaled$crim <- crime
table(boston_scaled$crim)
##
## low med_low med_high high
## 127 126 126 127
To evaluate our classification, let’s divide the data into sets used for training and testing the classifier.
library(dplyr)
# Randomly choose 80% of the observations and select these for the training set
n <- nrow(boston_scaled)
ind <- sample(n, size=n * 0.8)
train <- boston_scaled[ind,]
# Use the rest for the test set
test <- boston_scaled[-ind,]
Now we’re ready to fit the LDA model to the training data and evaluate it using the test data. Let’s also draw the biplot of the model.
# linear discriminant analysis
lda.fit <- lda(crim ~., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the model results below we can see that the first dimension which separates high values from the rest explains over 94% of the variance in the data (look at the proportion of trace at the end of the summary). The second dimension seems to discriminate between low and med_low and med_high values. However most of the variance in the data seems to be between high values and the rest.
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2425743 0.2574257 0.2524752
##
## Group means:
## zn indus chas nox rm age
## low 0.93885337 -0.9388990 -0.07547406 -0.8820255 0.49972537 -0.9307269
## med_low -0.07903235 -0.3119140 -0.03128211 -0.5813763 -0.10663005 -0.3541088
## med_high -0.38746839 0.1468514 0.18195173 0.3522210 0.03511629 0.4007575
## high -0.48724019 1.0171096 -0.07933396 1.0766174 -0.45443134 0.8094780
## dis rad tax ptratio black lstat
## low 0.8508552 -0.6867152 -0.7568490 -0.45799471 0.37717202 -0.79361636
## med_low 0.3741421 -0.5470944 -0.4911557 -0.08079711 0.33019325 -0.16569908
## med_high -0.3603541 -0.4120549 -0.3079477 -0.22906729 0.08965041 0.05141099
## high -0.8548830 1.6382099 1.5141140 0.78087177 -0.82570214 0.91335566
## medv
## low 0.61815041
## med_low 0.01817895
## med_high 0.11080844
## high -0.74015865
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12276148 0.73809690 -0.87512012
## indus 0.05245433 -0.26984601 0.47081179
## chas -0.07985937 -0.06293705 0.05428349
## nox 0.38227642 -0.55518934 -1.51694726
## rm -0.07305185 -0.08089938 -0.10227515
## age 0.25494274 -0.39890144 -0.07615446
## dis -0.05670001 -0.21248902 0.19357933
## rad 3.12622068 1.04909512 0.09301310
## tax 0.02903535 -0.13310625 0.46420733
## ptratio 0.10768771 0.08277360 -0.36360600
## black -0.11833632 0.01086220 0.16700413
## lstat 0.25821891 -0.26121090 0.28173819
## medv 0.17314091 -0.31126390 -0.30622264
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9495 0.0387 0.0118
Let’s now test the model on unseen data. First we’ll save the correct categories of the crime variable and remove them from the test data. Then we’ll predict the test data and display a confusion matrix of the results.
correct_classes <- test$crim
test <- dplyr::select(test, -crim)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 9 2 0
## med_low 4 16 8 0
## med_high 0 5 16 1
## high 0 0 0 25
Our classifier is able to correctly predict 22/22 of the high values in the test data, and 20/30 of the med_high values. However, it fares worse on the low end, with 17/24 of the med_low values predicted correctly, and only 12/26 low values predicted correctly. This might be due to most of the variance in the data being between high values and the rest.
To end this exercise we’ll reload the Boston data to do k-means clustering. We’ll begin by scaling the data again. Let’s also look at a summary of the data to see that everything is ok.
data("Boston")
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Everything seems to be as before, so we can go on. Next we’ll calculate euclidean distances between the observations and run k-means (which by default uses euclidean distances and also does the calculation for us).
Let’s first try with 3 clusters.
dist_eu <- dist(boston_scaled)
km <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Again it’s difficult to get a good overview of so many variables. Let’s look in more detail at the last four.
pairs(boston_scaled[,10:14], col = km$cluster)
The model seems to capture some clusters of values in the data well. For instance, it correctly clusters high values of the tax variables into one cluster. However, it seems to mix up the low values in two different clusters, although they are pretty closely located together, as can be seen from the plots.
Let’s now investigate what the optimal number of clusters is by checking the elbow plot using the total of within cluster sum of squares as the error measure.
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
plot(x = 1:k_max, y = twcss)
From the plot we see that the “elbow” is at two clusters, so we’ll rerun kmeans with 2 clusters. Let’s again look at the last four variables.
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[,10:14], col = km$cluster)
This model seems to do somewhat better in capturing different clusters of values in the data. For instance, observations with low values of the tax variable are now much more clearly allocated to the same cluster. However, the model still cannot tell apart for instance observations which have low values in black from those that have high values in black. This is might be due to those observations being more clearly separated from each other along some other variable.
date()
## [1] "Wed Dec 2 20:37:53 2020"
Let’s begin by loading the data and again changing countries as rownames again.
# Read data from local csv
human <- read.csv("data/human_reduced.csv", header=TRUE)
# Change countries as rownames and drop the redundant column
rownames(human) <- human$X
human <- human[2:ncol(human)]
Below are summaries of the variables in the data.
summary(human)
## edu_ratio part_ratio life_expect edu_expect
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## gni maternal_mortality adol_br repr_percent
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The observations in the data correspond to different countries, with the following variables:
edu_ratio: Ratio of female to male population with secondary education. The values of this variable are distributed between 0.1717 and 1.4967, with median at 0.9375 and 3rd quantile at 0.9968. It thus seems that the minimumm and maximum values of the variable are outliers.part_ratio: Ratio of female to male participation in labour force. This variable is more evenly distributed than edu_ratio, between 0.1857 and 1.0380. However, it seems to be somewhat skewed towards the high end.life_expect: Life expectancy at birth. This variable is skewed towards high values, with minimum at 49 and maximum at 83.50, and the 1st quantile at 66.30.edu_expect: Expected years of education. This variable seems to be quite evenly distributed between 5.40 and 20.20, although the minimum and maximum values might be outliers.gni: Gross national income per capita. This variable is distributed on a range betwee 581 and 123124, and is very highly skewed towards the low end, with 3rd quantile only at 24512.maternal_mortality: Maternal mortality ratio. Interestingly, the same thing as in the previous case is visible for this variable, with the distributed skewed heavily towards low values.adol_br: Adolescent birth rate. This variable is not as heavily skewed as the previous two, but still the minimum is 0.60, maximum is at 204.80, while the 3rd quantile is only 71.95.repr_percent: Percent of representation in parliament. The variable varies between 0 and 57,50, with the 3rd quantile at 27.95, and the values thus seem to be somewhat skewed towards the low end.Let’s next look at a visual overview to confirm these observations and to see the relationships between the variables.
library(GGally)
ggpairs(human, progress=FALSE)
We can see from the that the distributions of the variables edu_ratio, part_ratio, edu_expect and repr_percent approximate the normal distribution, but the other variables are skewed either to the low or towards the high end. There are many highly significant correlations between the variables. edu_ratio is significantly correlated with almost all the other variables, and life_expect and edu_expect are significantly correlated with all other variables but part_ratio. gni is also significantly correlated to all other variables except part_ratio and repr_percent. Finally, maternal mortality is significantly correlated with all other variables except repr_percent, and adol_br is significantly correlated with all other variables except part_ratio and repr_percent.
We can see these correlations more concisely in the correlation plot below. In fact, it seems that all variables except part_ratio and repr_percent are correlated either positively or negatively with the rest.
library(corrplot)
corrplot( cor(human) )
Let’s next reduce the dimensionality of the data with PCA to get a more manageable representation of these relationships. We’ll first try the technique on non-standardized data.
The code below first performs principal component analysis on non-standardized data, then saves a summary of the results in the object s, and calculates the percentage of variance explained by the first two principal components. These are saved as labels in the object pc_lab. Finally, a biplot is drawn of the PCA results, using the percentages as axis labels.
# Perform principal component analysis
pca_human <- prcomp(human)
# Save summary of the results, calculate percentage of variance explained, and save these as labels
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2,], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# Draw biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab=pc_lab[1], ylab=pc_lab[2])
PCA biplot 1: Gross National Income of countries vs. all other dimensions
We can see from the plot that, with the non-standardized data, the first principal component explains 100% of the variance in the data, and is perfectly correlated with the gni variable. This is the same thing as saying that gni explains all the variance in the data, which gives us no information about the underlying dimensions of the dataset as a whole. PCA is based on the assumption that variables with larger variance are more important than those with smaller variance. As noted earlier, the gni variable has huge variance in the data, with few outliers at the high end distorting the distribution. We can see this also in the biplot, where the length of the arrows representing the variables are proportional to their standard deviations. The arrow for gni is very long, which means that this variable in the model now has disproportionate importance in explaining the data. To get around this problem, let’s standardize the variables in the data and repeat the analysis.
human_std <- scale(human)
pca_human <- prcomp(human_std)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2,], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab=pc_lab[1], ylab=pc_lab[2])
PCA biplot 2: Development rate vs. political system of countries
With the variables standardized, the two first principal components are now able to capture variance in other variables of the data as well. The first component now explains 53.6% of the variance in the data, and the second component 16.2%. We can see from the biplot that the variables edu_expect, edu_ratio, gni, life_expect, maternal mortality, and adol_br are highly correlated with the first principal component, where as the variables repr_percent and part_ratio are correlated with the second principal component. This corresponds to what we observed earlier that the latter two variables are not highly correlated with the rest of the variables, because the first two PCA dimensions are uncorrelated with each other.
The first principal component is negatively correlated with edu_expect, edu_ratio, gni, and life_expect, and positively correlated with maternal_mortality and adol_br. Variance along this dimension thus seems to capture differences between the countries in terms of life expectancy, education, and per capita income, which also are the composite indicators of the Human Development Index (HDI). Loosely speaking, we could say that PC1 represents difference between “developing” and “developed” countries (a problematic distinction), where countries with high value on the PC1 dimension have high rates of maternal mortality and adolescent births, and low education level, life expectancy, and income indicators. Indeed, countries high on the PC1 dimension include Niger, Chad, Sierra Leone, and Mali, whereas countries low on the dimension include Korea, Japan, Switzerland, and Norway.
The second principal component is positively correlated with both repr_percent and part_ratio, and thus it seems to capture differences in the political organization of the countries. However this is a somewhat problematic interpretation as the part_ratio variable represents the ratio of female to male participation in labour force, which is not straightforwardly an indicator of the countries’ political systems. Further, the two variables are not significantly correlated, as we saw earlier. It thus might be that including a third principal component might be better able to represent the dimensionality of the data in terms of these remaining variables. Nevertheless, this dimension seems to say something meaningful about the countries in the data. In particular, countries low on the PC2 dimension include many countries from the North Africa and the Middle East, while countries high on the dimension include Nordic countries, along with countries from Southeastern Africa.
Finally, let’s turn to exploring the tea dataset from the package Factominer with Multiple Correspondence Analysis. The code below loads the data and selects a number of its columns for further exploration.
library(FactoMineR)
library(dplyr)
data("tea")
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, any_of(keep_columns))
Let’s begin by looking at the dimensions and structure of the data.
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
The dataset has 300 observations with the 6 variables we selected. All the variables are factors. Let’s next visualize the distributions of these variables.
library(ggplot2)
library(tidyverse)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
The how variable includes information about different methods of preparing tea. The most common one seems to be tea bag, with over 150 observations. By contrast, the How variable represents different additions to tea, with the most common one being simple tea alone. The lunch variable distinguishes between teatimes, and the sugar variable between whether tea is drank with or without sugar. Finally, the Tea variable includes information of different tea types, and the where variable about where tea is bought.
Let’s turn to reducing the dimensionality of these data with MCA. The code below performs MCA on the data with the 6 variables, and gives a summary of the results.
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
From the Eigenvalues table we can see that the first dimension captured by MCA explains ~15.24% of the variance in the data, while the second one explains ~14.23%. The variables how and where contribute most strongly to both th first and the second dimension, but more strongly to the first. The sugar variable contributes to the first dimension, while the lunch variable contributes to the second. Finally, the How variable contributes more strongly to the second dimension, while the Tea variable contributes more strongly to the first.
Let’s use the biplot to visualize these variables and their categories to get a better look at their relationships.
plot(mca, choix=c("var"))
From the plot we can see that the first dimension indeed seems to capture differences between observations in terms of the variables where and how. However, these variables also contribute strongly to the second dimension as well. Let’s look at the categories of the variables more closely to see how they relate to the dimensions and to each other.
plot(mca, invisible=c("ind"), habillage="quali")
From this biplot we can see that the first dimension seems to capture differences in whether tea is drank unpackaged or not, and whether it is bought at a tea shop or a chain store. This dimension thus seems to capture variance in how and where tea is bought, with higher values on the first dimensions corresponding to people who buy tea from specialized shops and prefer to have a wider range of selection to the prepackaged options available in chain stores. Indeed, the categories of tea shop and unpackaged tea are similar to each other in the plot, and the categories of tea bag and chain store resemble each other closely.
The second dimension seems to capture the “style” of tea consumption, with variance in terms of what kind of tea people drink, and whether they like to add milk, lemon, or something else to the beverage. This is also supported by the observation that whether people buy their tea from chain store or the tea shop, and whether they prefer teabags or unpackaged tea seem not to be distinguishing factors on this dimension. The composite factors of these categories are all at the mid-high end of the dimension, so they seem not to distinguish between observations. People on the high end of the second dimension like to add milk, lemon, or “other” stuff to their tea, and perhaps like to drink their tea during lunch. People low on the second dimension like to dring green tea without additions, and perhaps not to do so during lunch.
date()
## [1] "Wed Dec 2 20:38:03 2020"
…
RATSL <- read.csv("data/RATSL.csv", header=T, stringsAsFactors = F)
RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)
library(dplyr)
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
…
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
…
# Standardise the variable Weight
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdweight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
…
# Plot again with the standardised Weight
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
…
# Number of weeks, baseline (week 0) included
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of Weight by Group and Time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
…
# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline week 1).
RATSL64S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
# Draw a boxplot of the mean versus treatment
ggplot(RATSL64S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), weeks 2-11")
…
# Create a new data by filtering the outliers and adjust the ggplot code the draw the plot again with the new data
RATSL64S1 <- RATSL64S %>%
filter(mean > 250, mean < 590, mean < 475 | mean > 500)
# Draw a new boxplot
ggplot(RATSL64S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), weeks 2-11")
…
# Cannot do two-sample t-test with three groups, so just doing analysis of variance on the linear model
# Add the baseline from the original data as a new variable to the summary data
RATS <- read.csv("data/RATS.csv", header=T, stringsAsFactors = F)
RATSL64S2 <- RATSL64S %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL64S2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
…
BPRSL <- read.csv("data/BPRSL.csv", header=T, stringsAsFactors = F)
BPRSL$treatment <- as.factor(BPRSL$treatment)
BPRSL$subject <- as.factor(BPRSL$subject)
…
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
…
treat1 <- BPRSL[BPRSL$treatment == 1,]
ggplot(treat1, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(color=subject)) +
theme(legend.position = "top")
…
treat2 <- BPRSL[BPRSL$treatment == 2,]
ggplot(treat2, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(color=subject)) +
theme(legend.position = "top")
…
# create a regression model RATS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
…
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
…
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
…
# create a random intercept and random slope model
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
…
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
mutate(Fitted)
…
treat1 <- BPRSL[BPRSL$treatment == 1,]
# draw the plot of BPRSL
ggplot(treat1, aes(x = week, y = Fitted, group = subject)) +
geom_line(aes(color = subject)) +
scale_x_continuous(name = "weeks", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "fitted bprs") +
theme(legend.position = "top")
…
treat2 <- BPRSL[BPRSL$treatment == 2,]
# draw the plot of BPRSL
ggplot(treat2, aes(x = week, y = Fitted, group = subject)) +
geom_line(aes(color = subject)) +
scale_x_continuous(name = "weeks", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "fitted bprs") +
theme(legend.position = "top")
…